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MEASUREMERf IN ECONOMIC EDUCATION RESEARCH; ' 

■ ' ■ / ^ ' '■ ' . ^ ' 

'; * - . 

Ah uriderst^dlrig of measurement lis ^siserttli^ for jwork in education; I^oir 

' ' »» . 

e^lample, a noted measurement authority has stated: 



In today's educatldnal milieu just about 50 Percent of the problems 
we .encounter -do, in f act] Invofve" test use, tesrp coqsiractlbn; or 
> teat , Interpretation; Copsequentiy , just about any ^Ind of ^- _ 
specialist who, lacking knowledge about measutement, goes out ^to do 
' / battle with today's educational problems Is almost certalii to cbiie 

back a loser; For the present and fores'feeable future, educators who 
wish to b€ Effective In their work simp;iy must master the major 
'tenets^ of educational measurement (W; James Popham, 1981^"' p. 4). 

this recommendation for teducatars also applies to researchers in economic . 

' » _ * ____ ___ ■_ ♦ 

education; Knowledge of econometric technique^* iS not Skufflclen^^ to do the 

work, we also Seed a firm grasp of measuremer^t principles. 

In essence, empirical work in economic education beglhg with measurement. 

\ ^ -\ - - ^ • - - - :- 

We can ident if y , research problems^ specify hypotheses, arid construct an ela- 

borate research d^slgh^ but we must start bur empirical studies with measure- 



ment. In fact, many worthwhile research Itteas have prjobabiy been abandoned or- 
lack of available Instrument^ to measure Impoirtant^nputs or dtitputs^ The 



cbntlhulng wbjrk alsb depends bn measurement^ Statistical tests are based on 

cbmparlsbhs among measures. Conclusions are' dr^v«n from the statistical analysis 

of the measured -data; Strictly speaking, the use' of poor quality Instruments oir 

the use of Instruments with Incomplete* technitfat information' falser, doubt afcout 

. t ^ . / V 

the findings of a s.tudy, even if the study Is done carefully in all bther ' 

respects; . - * -t . , 

^ f ____ _ _ '__ ^ 

* Since measurement is , central to t^he research process in eco^ib^^ educatlbh^ 
this chapter provides a general Infrbductlon tb-tlie topic. The majbr cbghltlve 

f ^ ; i . ' _ _ _ _ 

tests bf ecbnbmic uhderstahdlhg are discussed f rbm the perspective of the u 
technical 'characteristics bf reliability, validity, and national' norms; 

_ •_ _; _ * 

Suggestions ar4 also of f ered f or what to^'do- when, a standardized economics test 
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is not avaii'abie. l?n additibrii affective iSeasures are now Widely used in 
eqbnbmic education^ but many iresea^fchets fail to report iafbrmation on rel^i- 
ability or validity of these liieaiBUres* This measureraent problem is examined in 
depth and general guidance is given on ways to evaluate new measures i 

A cfave^t is prdbably. in order; at this points The chapter: is not designed 
as a substitute, for the basic coverage of materiaf presented in ineasj|rement 
texts (c.f., Ebei; 1979^ Gronlund, 1981; Nunnally^ 1978). These texts offer 
extensive information on many' measurement topics and should serve' as a back- 

'■ J- 1 . ' 

ground reference in the same way researclTers use ecoi^raeCric texts. This \ ^ 
chapter ortly discusses the basic measurement t3pi6s of reliability^ Validlty^^ 
and norms as th^ apply to the m^jor rform-ref erenc^ cognitive. and affective 
instruments in ecbnomic education & resfearchers will have, a framework to'i; 
judging their technical quality • ^ ' 

> ; ■ • ' * ' ^ , 

The Mil jot Cbghitive— T^frt^ 

^ - . '"^ . 

/ - The a^sertibn that* raeasuremeht questions have been completely igno^j^ in 

> / 

economic educaticyi Research is hot spppbrted by the evidence, especi^ly when 
.bn^lbbks at. the major cdghitive measures. In a recent literature review, 
W^isbrbd found that 27^ petcl^nt of the 106 papers focused on "how to ^fine ^nd 
measure dutputis" (1979, p. 15), with no doubt most of the studies bei'ng of 

- r _ ^ ' , : _ 

cdgnlti5?e measures. Also, more than a decade ago Rendigs Pels recognized that: 

. _ _'_ c 

e Hypothesis-testing requires measurement,' quantification, and fet^, ^ 

bn^ reason there has been little hypothesis testing in economic 
educati^ has been the lack, until recently, of bbjeqtive i^asuring 
instruments (1970, pi 27). 

■- ' - _ . " • 

The instruments th^t S^els was - referring to were the:' Test bf Ecbnbmic 

^••5 . - : _ ■ ^• 

Understanding (TEU) for high ^chool students; the Cbllege Levels &nt^^n€^ . » 

Placement (CLEP) examinatidfi in intrbductbry ecbhbmics; and, the*theh hew Test 

of Understanding bf Cbllege Ecbnbmics (TUCE). 
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The TUCE became t\ie standard cbghitive ihp^trumerit iri ec^ndrnic education 
research with 62 of tt\e^ 100 empirical studies^iri the Journal 4^ Econoialc - 
Edacation from 1969 ^tCKi, mid- 1983 reporting its lise. The TyCE was recently 
revised after a -l^^J^^r p^ribd >nd the revision (RffUCE) is available far farther 

/^'L'.^.;_. ■ ' __ . 

/ research work (sfcnd.tfrs, i98i)i Siraiiariy, the Test of Economic Literacy has 



replaced the' TEU at the high school level (Soper, 1979). And, the new Basic 
Econontlcs Test (Chlzmar aHd' Hallnski, ^1980) offers researchers interested In 
measuring achlevemeckt at tffi^ upper elementary level a test to replace the Test 
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of Elementary Economl< 

^ . • , - ^\ _ _ I 

E'he'RTUCE, T^L, ^n^t- PET^^ve ecbhorai(* ^ucatidn researchers a sit of 

natloraiiy normed and sta,ndar<lized instruments at both the college .arid pre- 



coitegt levels.^ Eachs^asufei^''however, heeds to be carefully analyzed for 
information on its reliability^^ validity, ndrrain^, and test item data before it 
is used b^t^fesearchers. Blind use may lead, td misuse and close study of each 
instrument cin reveal where caution is necessary or where further measurement 



work is required. We begin with a discua^ston of reliability and ttt^n turn to 
vaJLidlty^ norms, and test item dat'ai * V j.. - 

- . . - "' ■ ^ 

Reliabi4i^ ^ , 

Reliability refers to the consistency of measurement ^ or the capacity df 
test to measure student performa|ice accurately. Any test which cdntains too 
much randomness (error) c^jnnot be used for making comparisons or decisions in 
research work* , Random errors of measurement are present in any test, so it is 
t+ie degree of consistency which is df interest and which we estimate when we 
loo^ at reliability. What we seacch for are instrumenST^Tftat^are reliable 
measures of student perfbrmahce dver time, over test conditions, and over 
samples of Items. ; 
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' Basically, there are four wayis to estimate reiiabilityi The first method 
provides an estimate of the stabltixy^ of the test over timei This property is 
estimated by the correlation between test scores of the same test given at two 
different points In time (asaaiiy over a two-week period) without any inter- 
vening treatment or inspractioni This test-retest method accounts for: (1)- 

constancy of student response on the test over time; ahd^^2) the c6hsist;enc:y of 

^ ,_ __ _ _ 

the test procedures since two test administrations are necessary. 

In theofy, a test only Contains a sample, of all possible items in the 

sampling domain., When "parallel" forms of a teat are develbped^ we have two 

samples of test items from the test dbmaih. By admihisterihg the two parallel 



testsi^tb students and cbrrelating their scores we cart examine the property of . 

' _ . _ _ ^ J _ _ _ ■ . : . / ■ 

test equivalence , where we look at how consistent scores are from one sample to 

ahbther. This equivalent-forms reliability method takes into account: 

(1) the consistency of measurement over different: Samples of items; and, (2) the 

consistency of the test procedures since two test administrators are necessary. 

When equivalei^t forms of a test are administered over a time perlbd^ then 

th^property' of tept egaivaience and stability canibe estimated by cbrrelating * 

the test scores from the two taest administrations. This cbmbihatibh feethbd 

"tik^ into account:. (1) the constancy bf ^student respbhse over time; (2) the 

'consistency test procedures; and, { the consistency of 'the test over different 

samp^le^ of items. We also refer tb this method as an equivalent forms rell- , 

• ' >^ * " ■ , ». • ' »• 

ability^ but with a time interval. - ' 



Finally, we can obtain information on the- Internal consistency of a testi' 

■ ■ . : V . ' . : 

Internal ' cbhsistehcy indicates whjether the itei?s* in\a test are measuring a 

- \ 

cbmmbh characteristic, or whether the .test Is homogeneous. A common approach to 

ejstimating Internal ' consistency ' Js to split the test in half an4 correlate the 

*_'_ >_ ,» _ - _ __ 

scores from each half, pro;i'ucing a split-half reliability estimate. more 
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sbphlsticated arid recbmmerided nfethod to estimate Internai^'conststency is by 
Kuder Ktchat^^sdn 20 (KR-20>^ or Cronbach alpha formulas* These procedures offer 
Internal consistency estimates which are essentially an average of all possible 
split-half coefficients and' they ar§ popular because the^ require only one tegt 
adraliilstrfatloni-^ • Internal consistency estimates accounts for: (D the 
consfstency of the test over diffei^ent samples of Items; and, (2) cbnslstency - , 
over tesE conditions since in theory two test scores are being cprrelated^ 

With^his basic background on reliability, we can now compare the four 

% - _ ' . 

approaches fecross consistency factors. As shbwn in Table 1, different infor- 
mation is provided by different estimates. A test-retest method with a^ time 
interval gives us information on the consistency bf the test prbcedure and 
constancy bf student response^ but tells us hbthlhg abbut the consistency of the 
test over different samples bf items. The internal cbhsistehcy estimates tells 

abbut the cbhsistehcy |f the test over •dif f iereht samples of items and over test 

_ ' 

procedures, but tells us nothing abdut the constancy of ^udettt responses over 

time. Only the cdef f icient estimate of ^equivaleryj^e anJ stability (equivalent- 



f4>rms reliability) accounts for all three consistency considerations, making it 

. 4 » ^ " 

the most rigorous reliability testi 



Insert Table 1 about here 



Reusability of Economics Achievement Tests 

We now have a framework^b judge the reliability %f the various ecdnoralcs 
achievement tests. Only ^ Iriterha:^ cbrisistehcy estimates are reporte^d for the 



BETi TEL^ and RTUCE.. These estimates offer information on the consistency of 
the measure bver different samples of items and dver test procedure; We have no 



infoirmation on the constancy of student response bvei ^^ime. In fact^. slhce^the ^. 
BET,' TffL, and RTUCE all have "parallel 'V forms , is surprising, thaf nb[ dataware 



* » 

te8t--€ik)ert 



reported to s^po^t* the equivalence assertion. In this case^ one 

_ _ _ ""^^^ __ _ ' _^ . 

recommends that "a teacher should look with ii>spicibn on any test; .that has two 

forms available and dbes^jibt report thfbrmatliri cdncerriln^ their equi.valence**— ^ 

- ~ \ ^ " \ ^ ' 

"because without evidence "tbe comparability the resiiits^ of ^he 'two .forms ^ 

^cannot be assumed" (Grdnlund, 1981, p. 980. The ^obable reason for the 

bmissibri was the expense arid difficulty of arranging tw6 test ^adrairti^tretion for 

• ^ _\ * ^ 

the large national sample; Yet, a smaller sample study could be off^Fetf as 
stability ancl Equivalence Evidence; So while we have sotpe *^r^^abil-it}^* * 



te 



judgment ^abaut the reliability of the-RTycfe, TEL, ^nd BET.^ 



infbrmation, ' further measurement work woaid help as make ^thore comply 

The, internal consistency estimates for th^ RTyCE , TEL, and;?BET can still be 
usefuli The posttest KR-20 estinates for the RTUCE were: .81 for ma1:rb- fbrm* A; 

.76 for macro ffcrm B; .75 fbr micrb form A; .74 for micro form B; .73 for the 
hybri(| micro/macrb form A^ and .71 fbr the hybrid micid/macrd fdrtii Bi^ The TEL 
showed Cronbach alphas bf .87 fbr each test fdrm. The .^alphas €dr the BET were 
•83 fbr fbrm A and .78 fbr fbrm B. ' ' ^ 

* What db these numbers mean? Reliability is measured on a scale from iOO to 
l.OOy with 1.0(5 indicating perfect reliability and iO0 indicating no reli- 

; ^ _: ___ _____ __' _ _ ' . A - 

ability. Since bur estimates are somewhere in between, but qver* .70, are the 

_ _ 1 _ ' 

iristrumerit reliable as estimated by internal consistency formtilas? The an^Jwer 

is yes for research piirposes. As Nunnaiiy U982) states: 

It is not necessary for the reliability to be as high in instruments 
that are used for research in education or related fields as it is 
for such .practical applications as assessing the progress of_ . 
j students in school i i i In basic reseaft:h a gbbd wbrkihg. rule is ^ 

that the reiiabtiity coef f icient should be at least .70, . but it is 
not always /Secesd^ry * to have reliabilities that r^nge ih6b the 90s 
(p. 1600). . >^ • , . 
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According toj tfiis working Btindardf none of the ^nstraments shoatd be used iot 



# a^riedj_ 



''work, where important ^eels-tons and reliability c^f f tcJ^erits over i90 are 



necessary. The BET., T|EL,Land RTUGE do neet t he ^ standard for research wotirki^ 



7 _ y 



On .0i^ny; -occasions reSe'arche'rs want to study whe difference in performance 



from prel>e^t t;p postte^z-on the RTUCE'i TEL^^ or. BET, where the difference score 
is- considered • be a measur^ of value-added .for ' ah ihstruCtidhal 'unit or ah 



experiment. Since difference scbr3& ar^calculated fr5m tw^b,*/ ailible tests, 



difference scores will be 3h imperfe<:t measure ^pf change* The reliability bf^ 
difference scbres is Ipwey than the average reliability of the twcJ tests fr^ite 



which the difference scbri is calculated. 



^ fbrmula 5br calcuj.atihg the relialiility of diff^erice scores is: 



r- + r-- - 2r - 
- AA BB AB 



-Where rj^^ is the reliability, of the- d^f feferu?^^ between test A j(pret^s'tj|feitad te(8t 

' - . - . • I ^ . ^ ■ \ 

B (oosttest); r^A is the reliability of test At rsB, the reliability of xest 



'«^and r^g is thJ l^or relation betwieeh pre- a fid pi 
test A and test B have the same reiiability (.•J/ 




tttestii 



and 



correiattqn -is i7,*then the reliabilit;y of ,th| ^^f erenc^ bes 



:atio 

m 




, if ,a 



St test 



00. Ah 



increasV in the reliability bf test. A and E tl .8 bhlyN increases the reliability 
of t/he difference t?st tb4.33. Aribther way tb increase the^^ reliability bf the 
difference ^core* is to de.crease the cbrrelatlbn between f he preftest and 
pbsttest, bat this raises questions abbut test validity (c.f.. Brown, t970, pp.. 
1^91)', . . . ^' 



^ The low reliability bf difference scores usually makes them an inco^nsistent 
' and risky form of meafeifrement to use. for either research comparisons or 
impbrtant decisions about student p'erf ormaiice; Unfortunately, there appears to 
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i 6e no. recbghitibiir in e<*S^9rai^^eduC'atioh literatu.re thai difference scores are 
basically unreliable imjl^ures and to date few researchers have examined this 
ine^stsremenL problem* WSile a value-added measure may be useful for diost cradl- 

clonal areas of economic research, when calculated using student test scores, it 

. \ - " ■ 7 

19 probably a measure^with low relifebili-ty . ' 

' One fiti^l point needs to be fflfentioned before we turn to other matters. 



^Reliability is a necessary, but not a sufficient condition for test validity. 

* _ ' V . 1 - ____ \ 

•We can produce an iristrurae'nt which has great internal consistency and good >y 

stability; ^f, however, a test does not raeasura the property that we wish it to 

. ^ 4^ -jj ^ 

measure, then the consistency or reliability of the measure is of ^U-t tie ♦value. 




A test with a high reitabtitty estimate does npt mean^that the ^test possesses 

-J- if 

high vaitdttyw A Complete look at the RTUCE, TElH an^ BET require , an inspection 
of the :vaiidtty of these tests. 

' ■'■ . ' ■ : ^ . 

Validity 

J The most important characteristic of a test iS its validity, or the exteat 

' to phich it measures what it is designed to i^^sure. Validity Is not a prqparty' 
''^hat the instrument possesses, t^t is specific to the sttaatton for whtch^the 
'instrument is ihtelWed ta be ifsed. The RTUCE, for instance, may be a v^ltd 
measure of introductory college economics; it is not a yalid measare/of intro- 
ductory college mathematics; Validity is also based on the "soundness of the 
interpretation*' of-the tesi r^aits for a particular group^ of individuals, and 

* - . -T_ 4t - : ' ^ 

only ^fhe interpretation Qf the tesD data has validity, not the test instrument. 

\ : _ ^ * V ' * _ 

Vali'ility^ therefore,' is determj:ne<^ in tiie contexb^f the situation where the 
Instrument is used and the interpretation of the results produced. While we may 
ase the terms "test ^lidlty" or the "validity of a testj" the above qualifica- 
tions should be remembe;red when studying the validity of the RTUCEi TEL, arid 



BETi >'. . . . * 
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» 

As was the case with reliability^ there are several types of validity for a 



^ _ .... 



refers to the desxee to which Items dri a test 



s^dequately • represent a sample of the content area under study, Contetu:^ validicy 

is the most important type of validity to consider for toost achievement tests 

becai^se of ttij^ facus on subject raatceri i^ri: terton-reiated validity invdives 

detemining how well performance pn the ^,^^t correlates with performance on • 

' artothe^ "c ri terion" measure. When we correlate performance on the instrument 

, and a criterion mea^^ure at ^^te same point 'in time^ we refer to 'this c^iteijibn-'. ^ 

related validity, as concurrent validity. Wtieh we use the perfbrmahce ah ;the 

test instrument to predict performance 6^:as criterion measure given at a latter 

point in time*, w^ are working with a form of criterion-related validity called 
• * , • ' 

_ _ _ ~ _ __ _ _ _" _ - -- g- 

. predictive validity. Finally* there is construct validity which includes 

methods for obtaining evideilce on how -well, ah instrument m&asures ai^ uhobserv- 

afcle "construct" such as "ecbhbmic uhderstahdihg. " Cbhstxuct validity is 

probably the most comprehensive of the three Dypes of validity , 'but proper test * 

validatian^ may require information on all types; * We will examine the RTUCE, 

-- - , * __ , 

, TEL, -and BET from the various validicy perspectives. ^ - • 

. Content Validity : ' - * , 

The estabiis'hmetic of &i^^nt validity is most .important for achievement 
ttest development and is probably the strongest featu?6 in the deveibpment of the 



RTUCE, TEL, and BRT. Each' tSest contains a test speci^i'catipn matrix which 
inclucies information _on the content domain cbver>ed by the tesit. The TEL and BET 
were developed using the content t^ramework in the Master CurriVsulum Guide 

-_ -.7 ^ •_ _ __ ' _ 

(Hansen, et. al . , 1977) to identify ^he cbntent areas which shbuld be* cbvered by 
the test. , (With the BET this list was .mbdif led somewhat since certain listed in' 
the MCG_ are nbt even taiJght to elementary ffiildreh) tfie TEL and BET^ -a , 



7 



ERIC 



^^working committee wrote the t^st qaesttons, pretested the. items, an^ received 
feedback Itom a national advisory committees Simila'r procedures were followed 
with the development of thi RtUCE^ except that no handjj^ written framework was 




available to specif y >cc^cepts < taught itx the "'typical" intr:oductory college 
economic coursei> The^test content, however, closely parallels ^most of the basic 
^concepts covered in a standard micro- or macroeconomics principles text* / 
^^^^-^^Test questions were also developed and categorized' according^ to a' cognitive 
level classif iSitibh. The BET and TEL cognitive vclassif icatioh were based on a 
modified form of tht widely used cognitive taxdndraic/systen^ develd by Blddtn- 



(1965), The RTUCE used a classif icatidn system defied by thie tiest develdpers, 

V • y ^. . \ 

^which categdrized questions as realistic, implicit-application, and explicit 

_ _ _ _ _ — _ _ . ^ ^ . _ _ _ _: 

application. Why an act ho^ specif ication was used for tl^e RTUCE rather than a 

■ ^ ^ ^ 

more widely accepted one, ^ such as Bioom^'s taxonomy, 'remains a mystery and may be 



a weakness in this test^^ (We will return to this point when we discuss - 
construct validity) . \ ' ; 

-The content-^cognitive test specification mat(rik for Micro form A of tji^ ' 

• ' >^ . 

RTUGE is provided in Table 2^ for illustrative purposes. Establishing content 

' ^ _ V 

validity is a complex process involving rational judgments^by experts and is hot 

simply A statistical calculation. The potential, problems with the content 

validity of the RTUCE, or for the TEL or BET^ involve the a|)pr6priatene"S^ ancl 

weighting 6^, the cbnteht^cbghit ive matrix. Criticism could be directed at test 

committee judgo^nt on the selection of concepts and the cognitive level at which 

they are tested for ttfe groups of introductory ecbhqmics students under stiitty. 



Insert, Table 2 ahbut here 



For example, the RTUCE can be viewed by sdme instructors as not being 
representative of the content covered in their particular intrdductory 



-u- 



coarsesi Since the original TUCE faced ^'similar cridjciigm, it may be worth 

reeinphasizitig 4 point made iri the original TUCE /manual : 

Whether the. TUCE ik a valid teist depends on the purposes for which 
•> " it Is used. Some ecdridmlcs instructors will no doubt disagree with 
the content or. objectives emphasized by the test committee.. Foir 
t.hese instructors, .TUCE will not possess content validity CFels, 'et 
al. , p. 15) . : ' • . 



This point ^Iso applies to users of the BET, TEL, and RtUpE. Researchers n|gd 
to make certain that the instrument is appropriate for the sltuj&tibri under 
investigation; The BET, TEL, and RTUCE are general achievement measures and 
when we wish to make c^^partson across courses on general achievement in 
ecronomics, these measures are quite valid to ifse. In a research investigation, 
there may be differences in what is emphasized in one course over another, but 
if ihe dfferences are slight, and if" cbmparisbhs are tb be made, then a 
Standardized measure is still appropriate. Oh the other hand, the RTUCE is not 
a valid measure to use for grading br evaluation purposes iri a course where the 
course cbntent differs substaritially from the content coverage of the RTUCEi 



Criteribh-^related Validity f . 

Criteribh-related validity is bf ten determined by correiattng. performance 
oh ecbhbmics test with a "criterion" instrtiraent; The major problem, of course, 
with this type ot validity is the selection of an appropriate criterion i7nstru- 
^merit. The better the Instrument, the stronger the validity evidence. This • 
prbcedore also gives researchers an empirical method for Supporting validity 
claims rather^ than the judgmental approach of content validity. 

None of the new. economics tes-ts provide any criterlbn-related evidence tb 

[ _ _ ~ ^ - - _ _ ■ - - -V , - - 

support validity claims, possibly because no suitable criteribh cbuld be found 
at the time the test was constructed. A few suggestions, hdw^yer.,. for future 
work cowld be^ made, RTUCE scores cbuld be bbrrelated with scores on the CLEP 
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econoin±c6 testi Scores on the TEL could be correlated with scores theirh^rld 

_• ■ ' ".'^^ _^ ' 

RTUGE, barge national samples may not be available .for this^ coiiciirr^nt ; validity 

- ^ _ ^ : : .-^ ■ • J, 

worlcy^bat some small sampl^ Studies might offer new validity evrdehce; . 

^ What might be of even more interest would be to use the instruments fbir 

i 

' predi,^ive validity work.' Are scores by high school, seniors on the: TEt useful 

_ - ' , . .* 

for predicting performance, either bri grades dir pn the RTUCE^ In the introduc- 

• tbry ecbhbmlcs course tak'eh a year later? '^bes the RTUCE h^ve -any predictive 

validity for later perfbrmance Iri upper level economics cours^es^ IZan the BET be * 

used tb pr^edlct student performance in ecdnoinlcs in Junior' high school? These 

questlbns suggest areas for future predictive validity work with the BET, TEt, 

and RTUpE. Infbrmatibrt bri the predictive power of our measures, while con- 

trblling fbr background variables, may help guide curricaiam work in economics; 

. » ,* • , ' • 

Cgpstruc t Validity : 

"Economic understanding" is essentially an unobservable construct which we 
wish tb measure, and so we naist consider construct validity as well as other 
types of vafidity. Several methods are used to establish construct validity, 
' Firsts we can make predictions about how certain groups will perform bh the . 

measure arid then test th6 groups and cbmpane perf brmarices. We might p.redict, 

_ • . ■ ■ - . ^ ' ^ ^ . \ , ...... ..V . ^ 

for'example, that the high schbbl 'St'^fierjti'^ wbuld scbre J.bwer bri the R^UCE thari 
college students whb have cbn5)leted the iritrdductdry course, but that graduate^- 

, : _ . - V 

students Jin ecbribmlQS wbuld butperfbrm bdth groups. Some limited evidence of ^ 
this ty^e is prbvided by the TEL since there is a statistically' significant • 
difference iri test scores fbr grbiips df students classified with aryd without 
\ fecbribmics tfainlngv A -statistically significant difference was also reported 

: " 

/ With groups who took the RTUCE as ^a pretest a nd"^ groups who took the RtUCE^ as a 

^pbsttest in an introductory economics course. 
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In addition*;' we' cao examine construct validity by making predictions abbiit 

»- , ^ 

^ • "-^ • 

the effects of intervention or treatmepi.* If test scores respond to iristruc- 

tion, :theri this finding is evidence tQ support construct validity. ^The best 
example of this evidence is provided with the BET. Ah analysis of variance was 
conducted examining BET test score^^hlle controlling for grade level, sex, 
instruction in economics, and a sex interaction variable. The analysis showed 
that studen^ scores in economics increased with the amount of economics instruc- 
tion received. What is unique about this analysis for- the BET is the attempt to 
control for background variables. More work osay be needed here however, since 



no 



measure of general ability or reading was included in the* model and grade 



r 



level may hot be ah adequate proxy for these factors or other factors might 
might be especially important (i.e. , socioeconomic statas)^ 

Obtaining pcvrrelatidns with other measures is a third way to support 

■A ' 

cbhstruct validity claims. Probably the most important construct validity work 
to be done with ail three tests is to support^ the cognitive level claims for 
nest questions. Assertions are being made that parts of each test are assessing 
higher level cognitive skills, but we have no evidence. Ah earlier study which 

(. . ------ 

sought to address this type of problem was conducted^by Lewis and Dahl (1971) 
for the old TUCEi in essence, this study looked at the cbrr|latibhs among the 
TUCE or it;s subparts, and other measutfes', sjuch as phe aCT test'^br a critical 
thinking test, to-assess tbfi ^cbghttiv^e leve> cbnst ruci: validity of- the: TUCE^i 



... A 



<fh*is ^^opic is ripe fbr^ further wbrkLwith the r^ew ecop<>mics teS,t dnd the Lewis 



'and Dahl scudy bffers a useful starting point. 



An impbrtant charact^istic of a staKidardized economics achievement test is ^ 



the availability of national, normsi To be certain that ^frcfrmihg data is useable/^ 



\ 
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* .... 

for cbmpartson purposes, both the qj^aritity and quality of the cbl^arative clata 
"^eed to be fudged. The usual cr|.teTia to^obk for ^re: (1) the hbrmihg sample 
jsize; (^-Jkifre recentness of the data collection; (3) the represehtative^ss 6? 
the sample; and, (4) complete description of th e t^ st procedures. 

An examination of the RTUCE^ TEL, and BET indicates that the instruments 
meetval^l these cri^teria. The RTUCE data were collected fpom over 7,000 college 
Students taking ihtrbductbry ecbhbmlcs cburses in 24 colleges and unive|:sities 

bf varibus sizes across the UrTited States in the spring term of 1979. TEL 

^ i > ' ^ __ 

nbrraing data were obtained from over 8,500 eleventh and twelfth grade students 

in 92 high school, classrooms in 36 states in May and June of i977i The BET 

norming data were cblle<:tedj|tdm over 14 ,pOO^^'ourth, fifth, and sixth graders in 

56 classrooms in. 23 states in May 1979. Thus, large sample sizes were used and 

the data were collected recently^ Given the spread of the sampling across 

/.states ^lid educational instttattbns ,; we also have some information that the test 

d<ivelopers ^tried to obtain a representative sample bf students fbr each level, ^ 

^ althoogh without random sampling prc)cedure.s we jnay never be as certain of this ' 

' judgment as we might^ish to be. Detail^ test procedures arid interpretation 

___ __ ,_ i. 

information is also contained in each published test manual. 

The quality of the nbrraing data is critical to comparisons made with test 

scores, either fbr grgiips or individuals. With norming data we can convert a 

'raw>scbre to a percentile rank based on the use of the norming sample data^ To 

Illustrate, if a researcher found that a class of twelfth graders received a 

.■f mean score bf !26 on the TEL (form A) after an lnst>racttonal*antt in ecortbmics, 

" then the researcher interprets the class performance as meaning that the class 

performed as well 'or better than SO^ercent of the nbrmihg sample bf 12th 

if. _ 

graders wlch economics instruction. Even if researchers had bther data avail- 
able on gr3pp\ performance in similar classes,, the hbrmihg data could still be 

^ _ ^ . . 16 ■ 
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used^ as a basts for comparison for aii classes and as a way to add meaning to r 
the Interpretation of the test scores^ 

When asing norms researchers liist-remember that the. norms were develbpecf 
from the noirming sample^ Norms are not statement of what should be or ought to 
be; they should not be viewed as standards. Norms are simply a large data set 
available for comparison purposes. This comparison is hot with all student br 
classes at that age (grade) levels just the hbrmihg grbup. 

The age of the hbrmihg data alsb becbmes mbre critical over time. The 
older a test the greater the probability the hbrmihg sample scores arie outdated 
fbr cbmparisbh purposes and reliability and validity date are more suspect. 
This prbblem is mehtibhed because test revision in economic ieducation has not 
been frequent. The hbrmihg data fbir the "hew" TEL is now over 6 years old and 
may sobh be ih heed bf revision, but a 15 year period lapsed before the TEt 
replaced the bid TEU. The developer of the RTUCE also recognized^ the time 
problem and recomiEended that the RTUCE tests "be revised more frequently than 
the 12 years that elapsed between publication of the original TUCE and the 
current revision." (Saunders, 1982, p. 10). As stated earlier^ empirical 
research is influenced by the quality of the data collected by the majbr test 
instrument, and cbnseqiiehtly , we all have a stake ih test ihstrumeht development 
even if we dp not db the measurement wbrfc. Timely revision of major measures is 
essential fbr research wbrk ih this ^eld. * 

Item Analysis 

Data oh the difficulty level of each item is provided in the BET, TEL, and 
RTUCE manuals. (pifficulty level refers to the percentage of students in the 
normlng sample who got the item right). ^ In addition^ data are presented bn the 
discriminating power, of each item, or the ability of our item to distinguish » 
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between students who do well on the test and those who do hot. With the RTUCE^ 
for example; the discrim±nating''power is mekisured with a pofeit blserial corre- 
lation bet^^n the mean score of those giving ^ correct response on ah item and 

♦ * 

the mean score of the total norm group for ^hat test. 

? While i.tera data may be of interest to instructors w;ho wish. to evaluate 
stufTSnt performance on particular items with the hbrmihg group, item data is 
usually of little interest to researchers. There may be items that researchers 
do not like or^^do not think show sufficient difficulty or discriminating power. 
A test, however^ is ah index and what we heed to know is whether this index is 
ah adequate measure of the construct under study.. This Jjuality is most properly 
assessed by the reliability and \7alidity characteristics. Remember the raaxii^^ 



judge the overall test, not individual items. 

Stand ardized versus Teacher- maxte^sts . 

Reliability and validity have been called tJhe "meat and potatoes" of 
eduG^^lonal measurements in the previous discussion we identified what to look 
for in the retl&biiity and validity information with a standardized achievement 
/'test in economics. In certain areas the RTUCE^ TEL^ and BET offer the , 
researcher Only limited information oh the major technical features, most 
noticeably with equlvaleht-f brms reliability estimates. Since measurement and 
test instruments lay the foundation for research work, we must continue to 
increase the amount of reliability arid validity data to maintain hi-gh standards 
for research. 

Nb suggestion is being made that these measures' not be used becai^s'Je of the 
lack of complete information. The RTUCE, TEL, and BET are the best avaklai>le~^ 
Instruments for research and are of good quality. The test develbpmeht process 
Isf^also an arduous one. "Researchers who ventured into this area and produced 



Is 
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Ihstrpmerits of the quality df^the RTUCE , TEL," and BET, given time and resource 
constraints^ are to be applaude<l for their labors wtiich resulted In a long-run 
cbhtributlbn to the field, i " 

' ' We shbulci also consider the alternative to standardized tests-teacher^made 

■ ____ ^ _ _ . ; _ _ 

teJts. The basic differences between_th€ two typ^s of testp should be reviewed 
bef'bre the use of , a standardized tnstr&ent is rejected iri,iavor of a "home- 
made'* sobstituie* As we have illustrated, Itefts for standardized measur^ are 



; carefully written, pretested, and seLected by a committee of experts; teacher- 

' ^: : ■ _ 

maderitest items are not constructed with the same level of quality. * 
StafidardtzedT tes t also provide a man>ial( with detailed reliability and validity 
data, 'norms fo^ comparison purposes^ and clear test prbi^edures; InformatiGfO bh 
th^' technical characteristics and test procedures is bf ten uhknbwri br 
unpublished with teacher-made tests. A standardized te^st can be used fbr 
comparisbh and research purpbses; the classroom test is only sufficient for 
evaluating individual student performan^ in ar particular classroom, and it is 



rarely bf acceptable qualit'5^ ^or use in research; in shorj, a teacher made test 

: . - 4- ^ ^ >^ 

is riat likely to inspire much confidence in the results of a study where thdy 

are used, and researchers should haye good grounds before rejecting the use of a 

• Q 

standardized test; 



A Modified TestJ ^ An Example - 

A teacher-made test may be appropriate when tip staodardized test is 
available for. use with the grpup under study br when there are limits tb the 
test period. Even in tViese sG.tuatlbn§ 'ifes€^4rctiers are better off searching for 
a previbusly develbped iristrumisrit as s source for items with some reliabtitty 
and validity Irifbrmatlon. ' ^ . 

Fbr example, in an evaluation study of an el&raentary school program, 
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Waistad (i979) used^a^^item version of the 40-1 t;cni; Test of Elementary 
Economics (West-Sprlngf iel^ , 1971) as the evaluation instrument since the BET 

had not been developed at the time the study was conducted. A shortened form of 

/ ' • : . • ' 

ir . 

the TEE was required for several reasons. Firsts the TEE was originally 
developed for use with sixth graders and in the study the tatget. group was 
fourth graders. Second^ a shortened Instrument was heeded to fit the limited ' 
classroom testing period. Third, the hbrraihg data from d ve r"^^ 2 , 500 elementary 
" students Iri New England was cfat^d and test reliability needed to be checked. 

When the TEE was administered to a separate sample of 63 fourth, fifth, and 
sixth graders in' a^ local school ^district, the reliability (1^-20) was a low 



.53i Eleven Items '*had difficulty levels (percent correct) below chance (^25) or 
had negative hlgh-iow discrimination (percentage difference between the highest 
and lowest scoring groups for the correct alternative By shortening the 
length of the test from 40 to 29 items, the reliability actually increased 

« _ _ _ _ _ ^ _ _ 

rather-.than decreased and was a modest .65. The item mean difficulty level was 
.43 and item mean discrimination level was .36. After the separate sample wbrk^, 
the reduced TEE was used in the study and showed ah internal cdhslstehcy 
reliability of .71^ which was acceptable for research iaat^^ If. th^ more diffi- 
cult and poorly disc rimi fiat 1 rig items of the original ^TEE had riot been elimi- 



riated, the reliability estimates would no doubt have been much iowe-r; 



The iraprtJvement' l/n reliability of the TEE by omitttojg"! terns did not. appear 
Xo come at the expense of content valtcUtyi A comparison was niade between the 
original concent flUatrtx and the reduced ccJntent matrix for the TEE. All content 



-areas*' in the original test were still represented in the reduced test. .The 



1 temaC^llml nat^ed were basically ones of a jlci<.i.ucix iij.si.uLj.<.aju [icii-ui.c< «=xai_j:;u 

to^the stud^v- or else the^c^ms duplicated material covered other itenjs. The 

. ^ > * \ ' 

reduced test^ reip resented the best available instrument to tesi the general 
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anderstanding and application df^ ecbribqiic cbheepts likely to be taught to 
clas^fes ax these grade levels. (See Table 3 for te§t matrix comparisons)* 



liisert Table 3 about here 
^ ^ : 



Conclusion on Teacher^made Te6ts _ • . , ^ < 

In instances where hb previous^ instruments are availabiHa to offer guidance 

or test questions, then researchers must start from "scratch;" Guidelines fpr ^ 
/ 'if ' . 



good^test construction can be found ^in most measurement texts; Ideally, ^the 
test should be pretested with a separat^ sample before it is used and 

•* * i 

' • < , ^ 

information mad£ available in the research report oii^'the descriptive test 
statistics, reliability estimate, and how validity was determined, TeaSher-made 
tests can be made acceptable for rese^ph (at least fbr explbratbry wbrk) as 
long as we have adequate documentatibh\ bf the technical characteristics so we^ 
can judge the quality bf the measure, ""N^ . > - 

One educational researcher has recomitiended that "editors and reviewers 
ought to— routinely return papers that fail^td establish psychometric properties 
bf the ins^tru^ehts they use" (Willsd^S>^980^ p. 9). This standard is a strict 
one arid if it were applied to cognitiv&^easures used in -recent studies 

ished' iri the Journal of Ecjopnomic Education, then a number of studies would 



be returned for lack of complete information; For example, studies by Ferber, 

, (1983); Paul (1982); and Swartz, e£. a^. , (1980) all use a tfeacher-made 
multiple choice economics test as a measure of output, but liho ^tudy provides any 
reitabiiity data on the developed measure. In addition^ the discussion bf test 

validity is limited since we ar^ not told what cbnteht areas were covered by the 

i - - - --- \ ' - ^ ^ - - ^' • 

test. In each of these studies^ fbr example, TUCE litems were iricluded as part 

x>r all bf the teacher-made tests^ but we are riot tbld which items were 

selected. In bther words, the reliability arid validity irifdrmatidri on ^the 
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itii^truinen'Cs is incorapiete and we are forced to accept the opihibn of the 
researcherv^ that the measures ate sound. When teachiBr-m^e or modified 
standardized tests are used, then more test information is required becf^ause wej 
have nq test manuaJL^o consult.^ ^ / 



f A ffective Instruments 

Affective instruments have been ihcbrpbrat€Kl in ecdhdmi^t education research 
from the first issue bfe the Jou rnal ^ Ecbnbmlc ^du catlbn ^ E^ety issu^ \ 
thereafter usually cbntains bhe or mbre^ articles with an affective measure as an 
input br output^ variabliB in model specif icatio«s. Const ru-cjKS which have been 
examined iri t|ie research literature include: i^^ttades toward^ methods of 



instruction (McConnell and tamphear, 1969); economic att±tade«Soph±st±cag:ion 

r • ' _ _ _ \ ' 

^(Marin and Fusfeld, 1970); student coarse evaluations (Villard, 4973); "Sfetitud^s 

* . ' ■ y 

towards economics (Karstensson and Vedder, 1^74)^; attitudes td'wards ecbnba^ic \ 
issues (Rid<lle, 1978); and, learning and instruc;tlbhal styles (Miller^ 19^1). A 
review of the findings the significance of sbme of these af^ctive variables 
at the college level is fbund in Siegfried and Fels(1979, pp. 930-937), 



The Neglect bf Dbcumehtatibn , : 

- - _ _ _ _ ' _ - . . . . . ^ . . . . . . . _^ 

• Paradbxically i while tfiere has been gri^t interest in affective instrument 

and Recognition of ^he heed to use nbrmed, reliable, and valid measures of 

cognitive achieveimeht , little attention has been paid to the measurement 

qualities |f the affective measures* It appears to be acceptable pra:Ctlce in ^ 

ecbhbraic educatpibn to develop an attitudinal measure and use it in research 



without documenting its technical characte^sticsi Henry and Rarasett, for 
example, examlneh the effects of computer-aided instruction on learning and' 
attitudes in principles courses^ All we are told about the attitude towards 



2^ i 



1 



I C^conofflics measjare is: '^^^^ score was bbtaihed by having students comprete bh 

^_ __ _ »_ _ _ _ 

* - attitude test at tjif^^end ojf the course." (p. 28). This example seem 

— Extreme; It is not, but the impulse to list' of S&e studies in ecbribmic education 

- ' ' - ■ * 1 - - 

that provide no documentation of the characteristics of the measure Ctsed or that 
provide*incom^ete dbcumehtatibh will be resisted and the ta^k lef t as ^an * 
exercise or theSreader. 



Th^re is no substitute for an e^act description of the affective measure 

with ihf blmat^bil ^n how the^ irist rument was dej;;eloped, the validation -procedures , 

- ^ 'K Uy^' - - - ^- - - - ■ ' ' 

its reliabiliOT% arid the, samples to which it has been administered by the^ 

« ' » -« . • 

original tievejfdper. We would also want at least .an estimate of tSe internal i*. 

^ . _ i ^ - : " 

cdnsis^ehc?^ reliability with r^ie groap-^nder study ; Devoting one or two * 

J - J i'i/^--. - 

paragrapns or lengthy footnotes to the description o:^ measures in an article is 
not too niuch to ask of researchers^ Since the conclusions are. ditimatel^ based ^ 
on the quality of the instruments used, this considejCtion should be sufficient ^s:^ 



justification for subtantiating the value of the instruments.^^ In additibh, 
affective measures are often vi-ewed/^s "softer** than cognitive measur^s^ amd it 

might be 'expect e<|, that more "rigorous" dbcumehtatibh -^uld be bbth desrired and 

X - ^ _ _ - - - - -- - - * - 1 

required'^^f^' affective, malts ures before they are used. The reverse hag been bhe 

_ ^ ' _ _ _ - ' - \ ^ - 

case td^'^te: higher measurement standards are found with cbgnitiSP^ rather th^n 

affective measures, on average. ^ 
Criteria fbr Affective Measu^reSv 



y 



Basically, the same evaluation criteria apply to both cognitive and to 

»r , - ^ : - - ' - 

affective measures. When cbn^idering validity, it is necessary to look at the 

ree kinds of validity evidence — content, cr^ierion-related , ' and construct. 

Achievement tests usually give most weight '^to/content validity, but with 

affective loeasureB the emphasis shifts to documentation ofrf construct validity. 

V 
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AlsOi criterioh-relat^d- vaJLirdity can be even mcyre difficult to deterniine in the 
affective domain ^hah ^n/tSe cognitive demairi because there are even fewer 



Iter J 



h iost rumehts. Trying to predict behavior ftbm responses to- 



suitable cr. 

If-repor^^ttitude taSesifties has jtidt met with^^uch sirccess; Despite these 



problems, w^-wlll seek evidence in b#1 thjree ways to support* our contention that 

■ ^ ■: V— ^ • . - 

the instramen£^|asares what it purports to nseasurei ^ \ 

information on r^^^biitty of the affective measure also should be ^ 

reported S^t4esearc^^s. tn jnost^ cases, ' affective measuf« require' only th^ ^ 
reporc tng '^of internal c6qsi#^ncy reliability. Estimates for stabilicj^ through 
tte use of a test-retest plrbce^dure may hot be Appropriate because of the . 
^reactive nature of the measures,, and pl^r^iilel form^ ar^ rarely available for. 

^ . /-i ■• - - - ■ ' )■ • ■ 

affective measures, ^ The\ reliability c^ge for affective ^^sures i^ of ten lower 

. •_ _ _ - - . - ♦ 

than foe achievement' liieasUres^ making a minimum standard difficult to specify, ^ 

_ _ * _ __ _ ^ ^ 

but any affective measure with a cdef f icient ^f •'•60\ or greater is pro*)ably ^ 



acceptable for re 



search i^rk. 



\ 



As was ^e case with cognitive tests, the norming sampis fo^ affective ' 
measures should also be large and ri^preisentative of the ^pog^ulatlon under ^udy 
so we have some assurance that the technical, property ^^^r f^liabiilty (or 
validity) ts ^ being estimatejd with an appropriate groups Affective- measures also 
need to /be revis^^ on.^ timely" basis to maintain. the value o f tl^ norriB; 5tnce 
the Bc&res for^an affective tnWroraent are summations across different items. 
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researchers should eschew item analysis and concentcate on the meaning ^nd. 

/ _ . . , V- . ^ 

interpretation of the overall scorer. , i V 



Tfae SEA:' Aa Ek^i^le 
The only 




% measure for^icbhbraic education which ^approaches the 



EKLC 



standards of su(^ ^gnitlye measures as lihe R.TUCE, TEL, c>r*BET isShe, two-part 
/ - , ^ ^ . • V 
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^ :EconomtC; Atirttades ^SEA) consisting of 28 Llkert-type^stateraerit s 

» — ^— ^ . . 

(Walstad and Soper, 1983; Sbper and Walstad, f brthcbming J • The flirst part of 

the SEA assesses attitudes towards economics (ATE), the« secortti part of the SEA 

• f , . " ■ ■ • 

examines ^conptai^ attitude sophJ.stif ication (EAS), or the degree to which ^ 

- -- ' ' ' - ^ • • - - 

student^vT&ws are in agreement with the consensus views of ecbhb)|ists on 

economics issues, the 'instrument was hatibnally hbrmed with a group of about 

l,K)0^higli school students (11th and I2th graders) in 67 schools in 35 states in 

May 1979. Small sample work also I'ridlcates that the Iristrumerit (at least the 

ATE) ma'y^e suitable for use at the cdilege level. 

' A detailed descriptibh bf the reliability arid validity work for the SEA is 

provided in the previously cic^d wbrk arid will drily be briefly described here; 

t _ _ _ . 

The CronBach alpha was .88 fbr thf^TE arid .66 for the EAS with the large high 
schbol sample. Siinilar estimates for each instrument were obtained with the 
cbllege samples. Although the alpha for the EAS is somewhat lower than the ATE, 



this difference is probably due to the difficulty of obtaining internal 
consistency when assessing attitudes *"on diverse economic Iss^^and to the short 
length of ^.tlie Nflseasure. Both ATE and EAS estimates, however, meet or exceed 
standards for research use; 

Reliability is only a necessary condition for v^alidityi so an in depth 
tnvestigatt^ was made of the cpnt<tnt and construct validity of each measure. A 



_ J_ __ _ . _ 

working committee reviewed the topics to ^^^hcluded in each attitude measure 

and received feedback, from a national advisory cbramittee tb select iterafi arid to 

judge overall content validity. Construct validity eviderice was firsf dbtairied 

for the ATE and EAS by statistically testirig for the expected differences in 



responses "ambrig known grbupsr^high schddl studeMs, iKtroductory economics 
students, advanced undergraduates, and collegef professors; Student scores on 
the ATE arid EAS-^were also cfrrelated with scores on other measures — an IQ test. 
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the RTUCE, artd the ACT — to examine whether each ineasure shb^wed a degree of 
uniqueness. In addition, for tlie EAS, a survey test was conducted ais part of 
construct validity work to help identify the "consensus" position of econoffiists^ 
and ecbribmic educators on ecbribniic issues; 

The SEA is just one example bf a natibrially nornied affective ijieasure for 
eCbribmic educatibn research; The SEA is not a "perfect" raeasare and obviously 



more information on its technical characeristics may be desired by users. The 

. _ . _ . _ _/ __ > ___ 

developAent of SEA iitastrates the extensiwe work necessary to document what we 



are meaaartng and how well we are measuring it and should represent a distinct 
improvement over the ad _hoc development of most affective instruments in 
economic education; 



Conciaaicm 



Although oseasurement is central to the research process in economic 
education, the topic is neglect^iS or given improper treatment in much research 
work. Maybe this attitude toward measurement is due to the excitement 
experienced by researchers in other phases bf work. Sbmehbw wdrrying abbut 
we are measuring impbrtaht input or butput variables is just nbt as^ exciting or 
glamorous as the formulation bf ^research hypbtheses, the develbpment bf the 
research design, br the di^tillatibn of statistical results into general 
cbhclusibris. Or, perhaps the bmisision is due to the strong economics influence 
bri economic education; most researchers are trained in econometrics, not ift 
psychometrics. In economics research, the data (e.g., GNP, CPl, or retai-1 sales 
figures) are usaaliy collected by other orgapiizations and individuals. Economic 
education research, on the other hand, requires the development br selection bf 
measures and data collection by researchers. Fev; hatibhal data sets are 
availabie for researchers to used wtich contain reliable and valid data bh 
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variables of research interest • Consequently ^ atteritlbri to, the technical 
properties of instruments used to collect the data Is essential for a sound 
empirical stud^ in economic education. , '^^^ 

An analogy drawn from home ecohomLcs rather than economics illustrates the 
problem. Many people enjoy baking — from selecting the recipe, to combining 
ingredients^' to drawing cbriclusion| about the final output. . Measuring, the 
ingredients., whl*ch is a necessary part of the culinary process, is >as exciting 

_ ____ ___ __ __ __ _/ >_.__^____vr- 

' or satisfying as the othfer j^li^iises of the experiment i Bat imagine what would 
. Jiappen when quantifies are estimated with invalid measures, or if unreiiabte 
measures were used to determine^Ke amount ,of ingredients, or if the- baker .knew 
nothing about the measurement process i Then, conclusions ^'drawn about the 
finished product and the experiment itself would change drastically. The 
analogy, as simple as it is V. highlights the research problem of reflecting 

measurement concerns^' _/ 

, - . ^ w 

, ' . • ^ ; - - - -. ^- , ' \^ 

At 'present "wee only have a handful of valid .and reliable instrument's for 

research work* If we ar^ to make more progress in exploring the dimensions of 

the ecojiomics -learning, ;theh we will ""heed hew measures and we will need to 

' revise the old ones bn^a timely basis, .pecker (Winter, 1983)^ re^j^rii zed t^is 

problem in a recent review bf -ecbhbmic educatibh rese^:rct^:" 

_. _ _ . : * _ _ ■ _'__,__ ____ ..!,_. ' ^ ^ _>i. 

The fact that appropriate cbgnitive- and affective-^ . ^ 

instruments- db .Jib t exist 'for^a specific assessment tas1c, suggests^ 
that we" shbul,<i attempt t^o develop such instruments i Reliable- and 
valid test instruments for all forms of learning are needed^ These, 
instruments must measure what they report To measure and do it 
: consistently acrosV individuals and oyer time^ ' Without valid and ' 
reliable iftstruments, it is impossible to tell whatsis being 
► ^ nseasured and to i^ke compari.sons to assess results, (p;. 15) . 



A caceer could be- made in economic education developing the needed ihstrumehts ^ 

and although the test development process is becbmXhg more sbphis ticated, the 

•> *,--•• « - <• ^ - ■ - ____.» _ . . 

oppor^anitles fSr mal?ihg. a solid cohtrlbutlbri to the field are great. Future 

• ■ ' ' •■. "-^ ' " ' / 

" progress irt research requires this specialized wbrk. ; 
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^For researchers Interested in criterion-referenced tests and the 
discussion df reliability and validity, ^as they apply to these^^^jii^t measures 

(see Pophaffl, 1981) i Some of the points discussed in this chapter ace -also 

_ ' *' 

presented in Walstad and Buckles (1983). 

^There are other nationally h&rmed economics te§ts produced by the Joint 

» 

Council on Economic Education. OSe is the Juhio^r Hlgbr School ^est^ ^ 



(JHSTE) and the Test of /Uhderstahdihg in PersonaJ. ^€^no^^ (1*UPE) The JHSTEX 
'was normed in 1973 and the TUPE iri 1970, They were omitted from th^ discussion 



due to their age. A hew ecdhdmics test for tlte 



^and^^Take- series is soon to 



be' released, arid will not be discussed since^ it was developed for a specific 
'ecdndmlcs program. The CI^EP i^ available from the Educationkl/Testing Service^ 
but is expensive to use in research worki 

^ • . ■ v . ' , , 

^Coefficient or dronbach alpha is the basic formul-k for internal 



1 

Scan be 



consistency^ When test items are dic^iotomous th^ KR-20 formula xah be used. In 
this case, Cronbach (1951) has shown that the KR-:20 and alpha estimates are 



equivalent. 



formula for coefficient alpha. is: 



n-l *■ ■ Vt 



4 



r 
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where^ n -^number of itens on the test; « variance of the total test; ahd^ 

£V^*» the 'Sum of the variance of individ^^ items. The dhly difference betweer\^ 
alpha and the RR-20 formulas is that ZV^ is re^pTaced by Zpq\ where Zpq » sum of 
the variance of iten» scored dlchdtomdusly . 

^In all fairnessi internal consistency estimates are still^ sufficient for 
mOfiLt^r ^search studies since the major sdurce df the measurement error te 
. prbbably due to item sampliMi A KR-'20 dr Crbnbach alpha l^stimate is alife 
Valuable because, it sets dn uprer bound to the reliability of a testi If this 
estimate ts riot high, then the dtheir types of reliability pstimates (equivalent 
fbrnfi) are likely to be even lower * (Nunnaiiy , ^978, pi 231). : ^ 



^Another problem with internal- consistency , est Imates . is that they may be 
inflated if the test becomes a sl^eed test rather than a power testi A power 



test allows sufficient time for all students to complete a test but a speed test 
does noti The RTUCE, TEL, and BET are destg^d as power tests, bat no data are 
presented which indicates that the time periojd specified in the manual^is 
sufficient for all:- students , so the reliability estimates may be inflated, 

^A number of factors can influei;^ce the reliability estimates bf a test. 
These include test le^th^ spread 6f the scbreS^ the difficulty of the test, the 
objectivity in scoring^ and the type bf estimating prbcedurfe (see Grbhluhd^ pp. 
104-111). ^ : . * " _ 

^Whether this measurement error presents a prbblei^fbr the statistical 
estimatibn depends bh the methbds used.- (See BectCer, 'Suinmer 1983, pp. 6-7).* 

^As Wolf (1982) notes: "Validation of a particular test usually requires 
an integration of, all threef^ypes of evidence, and one cannot f5e.eiy be 



substituted for another i ; ; There is a move towards viewing vali-fllty as a 
unitary rather than a tripartite concept (pi 1995) ' 

^There are different rationales fl$r the use of s^and^rn/zed and classroom- 
testsi ^For^a discussion of these points, see Becker, ^pid ttalstad (1981). \ 

^^So far we have discussed only paper-and pehcll-cbghitive measures." OthsCr 

_,_ ' . ■'«■._ ._ __ ._ _ _ '_ 

types of cognitive measures which db not rely bh paper-ahd-pehcil^ reactibhs may 

be'^developed K^e.g. , bbise^va tioAs of student ecbnbmic behavibr) . Tt^se measures 

must also l?e shown to Se reliable ahdr valid bef bre they are used f or research J 

:worki The psychometric data, may be more difficult tb cbllect than it is fbr a 

multiple- choice standardized test^ but the general standards still apply. 

. _ : fs 

^^Eveh studies which use a standardized measure, such as the TUCE, fail to- 

J? ' ^ ' - : _ ' _ _ - - - r ------- — - - - 

repbrt what fbrm bf the test was used (A^br B), how they were used (pretest and 



pbsttest), or repbrt^^ny reliability data on the use of thaj;^ ins trument 
sagiple uhdeir study, • , 
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I' 



12. 



The same conclusion applies to tlbri paper-and-pencll*- aJ f eC/W 
Wh'lie it may be di^fflcuit to provide extensive reliability and validity 
information^ this tlata must be furnished if we are to have any insight into what 
arid how well* the behavior is. being measuredi Observations and ratings ate 
fiddled with isseasurement error and invalidity. ,v^(Sete Nunnally, 1982, ppi 1596- 
1601). ' • . 
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TABLE 1 



Types of Consistency 



Reliability Property 
(Method) 



ConsisDency ConsideratioQ^ 



Test Constancy of Over .Different 

Procedure Response |^ Samples of Itetiis 



Stability 



(test-retest 
^ver time) 

Equivalence 

(equivaleht-f (rfrms) 
no time interval 

» 

St ability, aiid 

Equiva lence ' . 

(eqj^valerice-f drms 
with time interval) 

Iriternal Consist e ncy 

(KR-2b or Cronbach^ alpha) 



i 

X 



*ShQrt-Cenn coitstancy of response mdy be reflected Imt hot day to day cbristariey, 



1 



Adapted from Gronlund {197% p. 101), 
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TABLE 2.: 



. test/St>«rif ication Matrices for R^ueE.(r^>rm 



y:'\ Macro Form A 





* Content Categories 



Cbghitiye Categories 



itecognition & 
Understanding 



Explicit 
Application 



irnpticit 
Application 



_Nb, of 
Questibo$ 



A Measuf^AgyegateEcbhbrn ic Per iSmaree 



7, 16 



21X 



B. Aggregate Supply, Productive Capacity; and Economic Growth 



5. 28 



8X,26 



11X 



>^ -5 



C. ihcbme arid i)^>ehdifcre Approach to Aggregate bernand and 
^_ Fiscal Poiicy ' - '_ 



13. 22, 23. 27 



t^. 19 



b Monetary Approach to Aggre gate Demand and Monetary Pblicy 



12. 17. 24X 



3; 6. 



20X 



/ 7 



E Policy CornbinalKjns and Practical Problerns 61 Stabilifatibn Pblicy 



10. 25 



16X, 18X. 
79X 30X 



No. ol OueStioris 



Id ^ 



10 



30 



Micro F 




Content Categories 



Cogiytive Categories 



Recognition & 
Understanding 



g1^'v 

E: 



xplicH 
Ap^tcatidn 



implicit 
Application 




A The Bas*c EcoodfTic Pr6B»em 



s 4 



•13; 16X 



B Marfcets and tt* Prifce Meaianitm 



6X. 28X 



. 2 27 



C Cbsfs. Rev^iie. ProlM MS^ima^tion; and Market Structure 



22X.25X 




et PiiOe. EjctemaiHics; Go^nment miervention 

^ itibn , ^ ^ 



S. 24 



15X.21 



17X,20 



bistrttxjtion arid Govemmerit Redistrfeueoo 



10 



19X. 23X. pO 



3, 12X 



No of OuMtiona 



10 



10 



10 



30 



1 



A. 



Table 2 is froin P. Saunders (1981 ^ p. ^1J:^14). 
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TABLE 3 



Test x)f Elementary Ecbhbiaics Matrix 



(cyrlglrial ve^g^^s^m^jilfl cations*)^ 




.1 



Goncep 



c Area^ 



Knowledge 
Que st i ons 

Face Definition 



r 



Cdmprehension 



Application 

S3tiOn8 



Hooiehoid 
Baseness 



Govetnmenc , 
Exchange 

Techncblbj 
Market 

National ecdndii^ 



^ 2i 



13, Jl, 

34 

_9, 26 

39 

1, 14, 
18, 33 



35 



22, 25 
6, i6 



/. 8, 19, 

20, 40 



28, 2L 



7, 30, 31, 36 

11 , 27, 24, 38 
-m. 42, 15, 23_ 29, 32 

3» J. 



♦Underlined quesclons- were oraicc^di 
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